Incorporating preoperative frailty to assist in early prediction of postoperative pneumonia in elderly patients with hip fractures: an externally validated online interpretable machine learning model

Background This study aims to implement a validated prediction model and application medium for postoperative pneumonia (POP) in elderly patients with hip fractures in order to facilitate individualized intervention by clinicians. Methods Employing clinical data from elderly patients with hip fractures, we derived and externally validated machine learning models for predicting POP. Model derivation utilized a registry from Nanjing First Hospital, and external validation was performed using data from patients at the Fourth Affiliated Hospital of Nanjing Medical University. The derivation cohort was divided into the training set and the testing set. The least absolute shrinkage and selection operator (LASSO) and multivariable logistic regression were used for feature screening. We compared the performance of models to select the optimized model and introduced SHapley Additive exPlanations (SHAP) to interpret the model. Results The derivation and validation cohorts comprised 498 and 124 patients, with 14.3% and 10.5% POP rates, respectively. Among these models, Categorical boosting (Catboost) demonstrated superior discrimination ability. AUROC was 0.895 (95%CI: 0.841–0.949) and 0.835 (95%CI: 0.740–0.930) on the training and testing sets, respectively. At external validation, the AUROC amounted to 0.894 (95% CI: 0.821–0.966). The SHAP method showed that CRP, the modified five-item frailty index (mFI-5), and ASA body status were among the top three important predicators of POP. Conclusion Our model’s good early prediction ability, combined with the implementation of a network risk calculator based on the Catboost model, was anticipated to effectively distinguish high-risk POP groups, facilitating timely intervention. Supplementary Information The online version contains supplementary material available at 10.1186/s12877-024-05050-w.

Incorporating preoperative frailty to assist in early prediction of postoperative pneumonia in elderly patients with hip fractures: an externally validated online interpretable machine learning model

Introduction
As the population ages, the incidence of hip fractures continues to rise.It has become a global public health concern [1].Hip fractures could lead to serious consequences, not primarily due to the rupture itself, but due to the accompanying comorbidities and a range of postoperative complications [2].Among these, postoperative pneumonia (POP) is one of the most common complications, with an incidence ranging from 4.1-15.2%[3,4].Optimizing surgical planning and perioperative management based on preoperative patient status is a promising strategy for early intervention in this complication.Therefore, it is significant to develop a reliable prediction model for early identification and prevention of patients at high risk of POP after hip fracture in the elderly population to improve their postoperative quality of life.
Most of the current studies have focused on the exploration of risk factors for POP.The elderly are prone to multi-organ degeneration, and several comorbidities have been suggested to be independently associated with POP, such as diabetes, respiratory disease, and heart disease [5,6].Patients with multiple comorbidities are often in a frail state, with clinical manifestations of reduced physiological reserves, increased vulnerability to death, and increased susceptibility to stress [7].It has been shown that frail patients have higher postoperative complications and mortality than non-frail patients in orthopedic surgery [8].Incorporating frailty assessment into routine clinical practice is expected to improve the management of POP in elderly hip surgery patients, but there is insufficient clinical evidence to support it.
It is often difficult to achieve the desired predictive power only through individual predictors and can not give accurate prediction probabilities.Therefore, a tool is needed that can combine multiple predictors and can flexibly capture the direct correlation between predictors and outcome to achieve precise prediction.Large population-based prediction scores for postoperative pulmonary complications have been developed, but they are not specific to pneumonia as an outcome [9,10].Zhang et al. [11] and Xiang et al. [12] developed nomograms for predicting POP after hip fracture based on a simplified assessment of the significance of the variables using traditional algorithms.It would be easy to understand but not readily capture the complex relationships between variables.Although the above two nomograms achieved good predictions, they were still not for getting a clinical promotion because they were neither internally nor externally validated, indicating that these good performances may be unreliable, followed by the lack of an online medium for clinical application.
In contrast, the machine learning (ML) approach is considered to be an advanced statistical approach that, in comparison to the "simplified" process of traditional methods, can perform "systematic" inference, making full use of data information.Moreover, as the sample size increases, it can self-learn the updated data and continuously improve the predictive performance.There is still a gap in the application of ML in POP prediction after hip fracture.
Therefore, the main objective of this study was to identify independent risk factors for POP after hip fracture in elderly patients and establish a prediction model based on a ML algorithm to achieve early prediction.In addition, a network risk calculator was also built to provide accurate prediction probabilities to aid clinical decision-making.

Study participants
The derivation cohort consisted of patients with hip fractures who underwent surgical treatment in Nanjing First Hospital (China) between March 2019 and April 2021 and were retrospectively analyzed in this study.Clinical data in the validation cohort were collected from the Fourth Affiliated Hospital of Nanjing Medical University between February 2020 and December 2022.The institutional review boards (IRB) of Nanjing First Hospital (Nanjing, Jiangsu, China) and the Fourth Affiliated Hospital of Nanjing Medical University (Nanjing, Jiangsu, China) approved this study based on the Helsinki declaration (Protocol code: KY20220621-04-KS-01, 20,230,322-k106) and waived the written informed consent requirement owing to the retrospective nature of this study.This study was not concerned with confidential patient information.

Inclusion and exclusion criteria
Patients aged 65 years or older and hospital admission for femoral neck or trochanteric fracture were included in this study if they underwent total hip replacement or hemiarthroplasty.Conversely, exclusion criteria were patients with (1) pathological fractures; (2) multiple fractures or multiple trauma; (3) conservative treatment; (4) pneumonia that occurred before surgery.Furthermore, some patients, especially those with a history of hip fractures, were deemed ineligible to participate in this study.Finally, some participants were excluded from the study due to missing data on pretreatment features (missing rate > 10%) or the clinical outcome.

Data collection
All data were obtained from the Surgical Anesthetic Information System and Hospital Information System.After a review of the literature and consultation with clinical experts, the final preoperative available variables for inclusion in the analysis were determined, including demographics (e.g., age, gender, body mass index (BMI)), laboratory measurements (e.g., C-reactive protein (CRP), preoperative hemoglobin), disease history (e.g., hypertension, diabetes mellitus), preoperative incidents (e.g., type of fracture, preoperative length of stay).Frailty was assessed using the modified five-item frailty index (mFI-5), which was based on five variables provided by the National Surgical Quality Improvement Program (NSQIP) [13].The five variables included congestive heart failure, chronic obstructive pulmonary disease (COPD), diabetes mellitus, hypertension requiring medication, and non-independent functional status (totally or partially dependent functional status) [14,15].If a variable was present, it was given 1 point, and the score ranged from 0 to 5 points.

Outcome
The elemental outcome was pneumonia during the postoperative period before hospital discharge.The criteria for POP diagnosis were based on the NSQIP [16,17], which required the fulfillment of at least 1 of 2 criteria: (1) the emergence of purulent sputum or a modification in the characteristics of sputum; identification of an organism in a blood culture; pathogen detection in a specimen obtained through trans tracheal aspiration, bronchial brushing, or biopsy; or (2) histopathologic evidence of pneumonia.In addition, they must meet 1 of the following two criteria: (1) the presence of rales or dullness upon percussion during a physical examination of the chest or (2) a chest radiograph that demonstrates new or progressive and persistent infiltrates, consolidation, or cavitation.

Statistical analysis
The mean and standard deviation were used to describe all normally distributed continuous variables using the t-test method.The median and interquartile range were used for non-normally distributed data, and the Mann-Whitney U-test was employed for analysis.Categorical variables were presented as frequencies (percentages) and assessed through the Chi-square or Fisher's exact test, as appropriate.A P-value < 0.05 (2-sided) was considered statistically significant.We performed statistical analysis using IBM SPSS software (version 25.0) and R version 4.2.2.

Data preprocessing
The derivation cohort was divided randomly into two sets: a training set and a testing set, with a ratio of 3:1.
The training set was utilized to select features, train the model, and tune hyperparameters.Meanwhile, the testing set was used as an internal validation to assess the reliability and stability of each model.It is common to encounter data that needs to be included in practice.Filling of missing data using K-Nearest Neighbor (KNN) method [18].Specifically, the missing values were filled in using the KNNImputer module from the "sklearn" package.This module takes into consideration the values of the optimal number of neighbors during the imputation process.This approach allowed us to retain the integrity of the data, and ensure that our analyses were based on full sample size and complete data.Moreover, to prevent data leakage, imputation was performed after splitting the derivation cohort in the training set and testing set.In addition, to ensure consistency in the study, after dividing the training and test sets, all continuous variables were subjected to Z-Score normalization, and categorical variables underwent One-Hot encoding [19,20].Python (version 3.10.4)was used for data preprocessing.

Variable selection
In this study, feature selection was performed on the training set using the least absolute shrinkage and selection operator (LASSO) [21].The LASSO method uses hyperparameter lambda (λ) to minimize regression coefficients towards zero during the model estimation.This approach excludes many weakly correlated features by assigning their coefficients to zero, while we chose nonzero variables for further analysis.The primary objective of LASSO hyperparametric optimization is to reduce the cost function.Preoperative factors were integrated into the LASSO regression model to evaluate the POP risk in patients before surgery.Lambda was selected from a range of 500 numbers between 0 and 0.5, and ideal hyperparameters that minimized the objective function were identified through 10-fold cross-validation.To prevent errors that a single 10-fold cross-validation could cause, this process was repeated 50 times for each LASSO model.Then, we employed the Variance Inflation Factor (VIF) to evaluate the multicollinearity of the independent variables acquired through LASSO, and factors with VIF > 5 will be excluded [22].Multivariable logistic regression analysis was performed to determine the variables predicting POP, and the results were expressed as odds ratios (OR) and 95% confidence intervals (95% CI).The prediction model was constructed based on variables with statistical significance (P < 0.05).The LASSO was performed with R package glmnet 4.1-3.

Model development
In this study, we utilized five different ML classifier algorithms to predict POP.We evaluated their performance: logistic regression (LR), random forest classifier (RFC), categorical boosting (Catboost), extreme gradient boosting (XGB), and light gradient boosting machine (LGBM) [23,24].We applied the grid search algorithm and 10-fold cross-validation to optimize the hyperparameters for each model.The grid search approach exhaustively investigates all the possible hyperparameter combinations within a specified range to identify the optimal selection.Meanwhile, the 10-fold cross-validation randomly divided the data into ten folds or sections, with nine used for training and one for validation, to evaluate the model's performance thoroughly.Moreover, the class imbalance was handled by setting class weight to the inverse prevalence of their class [25].The "sklearn 1.0.2","xgboost 1.1.1," and "xgboost 1.5.1"packages in Python were used to construct all ML models.

Evaluation and validation
The evaluation of models involved an internal validation using 10-fold cross-validation within the testing set, which aimed to assess the stability of the models.Following this, external validation was carried out to evaluate the generalization capability of the models.The area under the receiver operating characteristic curve (AUROC) and its 95% CI were applied as the primary metric to measure the discriminatory power of the models.The AUROC of 0.5 indicated random guessing, while an AUROC of 1.0 indicated perfect classification.A higher AUROC demonstrated better performance of the model in distinguishing between positive and negative cases.The Delong test assessed the statistical differences between two AUROCs for the five models [26].The optimal threshold of the prediction probability was selected by the receiver operating characteristic (ROC) curve, and the confusion matrix values such as sensitivity, specificity, accuracy, and F1 value were employed to evaluate the risk stratification ability of the models.Additionally, the area under the precision-recall curve (AUPRC) was utilized to quantify the performance of models, specifically the trade-off between precision and recall at different threshold values of the model's output score.A higher AUPRC indicated better precision-recall trade, meaning the model effectively identified positive cases while minimizing false positives.
The model calibration was evaluated graphically by plotting the predicted probabilities against observed outcomes.The plot can compute the calibration intercept and slope; the perfect values should be 0 and 1, respectively.The Brier score was also used to measure the accuracy of predicted probabilities of each model, and the value 0 indicated a perfect prediction, while 1 showed an inferior prediction.Based on these performance metrics, we selected the best model.

Model interpretation
SHapley Additive exPlanations (SHAP) values were calculated using the "SHAP 0.40.0"package in Python, which used a game theoretic approach, to explain the output of ML models [27].These values provide a metric for assessing the relative importance of a feature to other features, taking into account how that feature impacted the loss function.Moreover, the Shapley values indicate the direction of the relationship between corresponding features and the target.The mean absolute Shapley values were used to quantify the SHAP feature importance.The SHAP bar plot visualizes which features influence the model's prediction most.In contrast, the SHAP scatter plot helps identify whether a variable positively correlates with the outcome.

Patient characteristics
From March 2019 to April 2021, 498 eligible patients were included in the derivation cohort (Fig. 1).The demographic and clinical characteristics of these patients on admission have been described in Table 1.Among them, 447 and 51 had been diagnosed with femoral neck and trochanteric fractures.Furthermore, 71 patients (14.3%) were diagnosed with POP.Patients with POP were older than those without POP (P < 0.001), and there was no statistical difference in gender and BMI.CRP and mFI-5 in patients differed between the two groups (P = 0.007 and P < 0.001).Chronic obstructive pulmonary disease, heart failure, smoking, preoperative peripheral oxygen saturation (SpO 2 ), ASA physical status, and preoperative length of stay differed between patients with and without POP (P < 0.05).These patients were randomly assigned to a training set (n = 373) or a testing set (n = 125), with pneumonia incidence rates of 13.6% and 14.5%, respectively.Demographics and clinical characteristics were almost well-balanced in the two groups (Supplementary Table S1).
To validate the prediction models from the derivation cohort, an external validation cohort was collected in the Fourth Affiliated Hospital of Nanjing Medical University between February 2020 and December 2022 (Fig. 1).A total of 124 eligible elderly were included in the validation cohort using the same inclusion/exclusion criteria as the derivation cohort.Among them, 13 patients (10.5%) were diagnosed with POP.Supplementary Table S2 provided baseline characteristics of subjects who underwent surgical treatment.

Feature selection
A few variables had some missing, the specific percentage of missing were listed in Supplementary Table S1, which we filled using the KNN method.In the training set, 24 variables were included in the selection procedure.The LASSO identified eight non-zero coefficient characteristics associated with POP (Supplementary Figure S1).The characteristics included age, CRP, preoperative length of stay, mFI-5, smoking, preoperative SpO 2 , fracture type, and ASA physical status.Furthermore, there was no collinearity among the eight variables (Supplementary Table S3).Multivariable logistics regression analysis was performed for the eight significant variables, and seven independent predictors of POP risk were identified, including age, CRP, preoperative length of stay, mFI-5, smoking, preoperative SpO 2, and ASA physical status (Table 2).

Model performance
We constructed five different ML models, including LR, RFC, Catboost, XGB, and LGBM, and evaluated their performance to predict POP occurrence.The best hyperparameter combination for each model was provided in Supplementary Table S4. Figure 2 described their AUROCs and AUPRCs on the training and testing sets.As shown in Fig. 2, on the testing set, the Catboost model yielded the highest AUROC value (median, 0.835; 95%CI: 0.740-0.930)and the highest AUPRC value (median, 0.548; 95%CI: 0.343-0.737).The LGBM model had the next highest AUROC value of 0.754 (95%CI: 0.645-0.864).XGB model had the next highest AUPRC value (median, 0.390; 95%CI: 0.213-0.601).Based on the Delong test, there were statistical differences in the AUROCs between the Catboost model and other models in the testing set (Supplementary Table S5).Additionally, the Youden index of ROC was employed to identify the appropriate threshold for each model.As a result, we obtained the accuracy, sensitivity, specificity, and F1 value of each model under the point, and the results can be shown in Table 3.
The Catboost model achieved the highest accuracy, sensitivity, and F1 value in predicting POP among ML models on the testing set.The RFC model showed the highest specificity for predicting POP.Significantly, the calibration plot indicated that the Catboost model was positioned closer to the diagonal reference line, yielding the lowest Brier score of 0.112 (Fig. 3).

Model interpretation
The contribution degree of potential risk factors was visualized and ranked by the SHAP method using the Catboost model (Fig. 5), highlighting the most important feature.The results in Fig. 5A demonstrate that CRP, mFI-5, and ASA physical status significantly impacted predicting the outcome.Figure 5B was the scatter plot, in which red and blue dots represented higher and lower values of the features, respectively.The red dots were distributed within the range of positive SHAP values for mFI-5, suggesting that patients with higher scores had a greater risk of developing POP.All predictors were identified as positively correlated with the outcome and considered risk factors.We also applied this approach to analyze other ML models.As shown in Supplementary Figure S2, preoperative SpO 2 , preoperative length of stay, and smoking were significant variables among the seven factors for these models, indicating that these variables impacted the outcome.

Construction of the web calculator
The Catboost model equations have been integrated into a risk web calculator, accessible at https://predictionprobability-of-pneumonia.streamlit.app/(Fig. 6).The established web risk calculator could offer clinicians a practical tool to identify high-risk patients for early intervention or a practical demo tool.It also provided research support for the development of medical device software based on the ML algorithm.

Discussion
Clinicians are often asked to help with preoperative risk assessment and perioperative medical management.In this study, for the first time, we took full advantage of ML to develop and validate an effective early POP prediction model for elderly hip fracture patients by combining seven routinely obtained preoperative variables.We built a web risk calculator to achieve a medium for clinical application.
The Catboost model is considered a powerful ML algorithm that can efficiently handle category-based features and take advantage of ensemble learning to achieve high accuracy predictions [28].Our study demonstrated that the Catboost model achieved a high AUROC: 0.894 (95%CI: 0.821-0.966)and AUPRC: 0.550 (95%CI: 0.320-0.761) in the external validation set, proving to perform well in the unbalanced datasets.The point was also reflected in the sensitivity (0.765).High sensitivity is crucial for clinical applicability, as failure to correctly identify patients with POP may have serious consequences Another advantage of our model was the establishment of a web risk calculator based on the Catboost algorithm that anyone could access online.The probability of a patient's risk of POP could be output directly after the predictive characteristics were entered, saving time for manual calculation and greatly increasing the ease of clinical application.Moreover, it is important to combine the accurate prediction probability from a complex model with how to obtain the interpretability of that probability.Therefore, we added corresponding SHAP visual interpretation plots to the calculator output results that support getting the value of each variable's contribution to the outcome probability.To some extent, this improved clinicians' recognition of the model results.In LGBM, light gradient boosting machine addition, these variables were all readily accessible preoperatively, facilitating the realization of early risk assessment and reasonable adjustment of perioperative medical management.
Among the predictive variables, the mFI-5 was simplified from the modified 11-item frailty index (mFI-11), making it easier to utilize in daily clinical practice.And the mFI-5 has been reported to be as effective as the mFI-11 in predicting mortality, postoperative infection, and unplanned 30-day readmission [13].In a prospective study, frailty has been considered to influence the susceptibility and severity of community-acquired pneumonia in elderly patients [29].In patients with hip fractures, a high mFI-5 was significantly associated with poor functional recovery, total complications, and serious medical complications (e.g., cardiac arrest, myocardial infarction, and septic shock) [30,31].Elderly patients with high mFI-11 who underwent abdominal surgery were also confirmed to have a higher risk of postoperative PPCs [32].The positive association of mFI-5 with the probability of POP in elderly patients with hip fractures could also be seen in our SHAP summary plots.Besides, although frailty is usually age-related, frailty related to disease still accounts for an important part [33,34].In these patients, disease or comorbidities are probably the most significant cause of the decline in physiological reserve.
In addition to the non-modifiable factors of mFI-5, age, and ASA, those potentially modifiable factors (e.g., CRP, SpO 2 and smoking) may be of greater concern.Firstly, preoperative CRP reflects the inflammatory status of the patient.Although it is a nonspecific marker of systemic inflammation, it has been proven to be a predictive variable of postoperative infection (including pneumonia, surgical site infection, and urinary tract infection) and mortality in hip fracture patients [35,36].Our study further confirmed the predictive role of CRP on the occurrence of independent POP infection rather than postoperative overall infection symptoms in elderly patients with hip fractures, with its contribution value to the prediction model ranked first.Secondly, low preoperative SpO 2 increased the risk of POP, which was consistent with the findings of Russotto, V et al. [37].SpO 2 has also been identified as a predictive variable of postoperative respiratory failure and postoperative pulmonary complications [38,39].This simple, non-invasive indicator provides early warning for patients with low lung function.Clinicians could take measures such as lung function exercise for early intervention in patients with preoperative SpO 2 below 96% to reduce the risk of POP [40].Thirdly, preoperative smoking cessation is strongly recommended for smoking patients, and guidelines have shown that this preventive measure could reduce patients' perioperative risk, including the occurrence of POP [41,42].
Patients with delayed surgery have a longer length of bed rest, which may increase the risk of exposure to Table 4 The performance of the five final models under the optimal threshold for external validation  pro-inflammatory conditions, and reduces the patient's ability to expel sputum, thereby increasing the risk of POP [43,44].Numerous studies and guidelines recommend that elderly patients with hip fractures receive prompt surgical treatment within 48 h or even earlier after admission [45][46][47].Our study indicated that preoperative length of stay is positively associated with the risk of POP, which is consistent with most previous studies [48].However, the fact remains that for some patients in poor health on admission, necessary preoperative examination procedures and interventions may be required.Balancing the patient's preoperative status with the length of the wait for surgery remains a critical task for clinicians.There were still some limitations in this study.Firstly, similar to many retrospective studies, some information was reported by patients or their family members, which inevitably had an innate selection or recall bias.Secondly, the data used for model construction were collected based on a single medical center.Although our model has been validated in a recent three-year database of elderly hip fractures at another medical institution, the sample size was small, the number of patients with positive outcomes was even smaller.And those performance metrics focusing on true positives, such as sensitivity, were calculated based on this rather small number of patients.Future validation of our model in larger sample databases is still needed.Thirdly, this study did not include information on intraoperative variables and perioperative antibiotic use in the analysis, and how this information would affect the occurrence of POP still needs to be further explored in the future.However, modeling only by preoperative factors could enable early clinical prediction and guide early intervention.And it is noteworthy that the details of the surgical protocols and medication regimen for the treatment of hip fractures differed between centers, which helps to explain the heterogeneity of the results across studies.

Conclusion
In this study, CRP, mFI-5, and ASA body status were the top three important predictors of POP.And to our knowledge, this was the first to identify preoperative mFI-5 as an independent risk factor for POP in elderly people with hip fractures.Subsequently, the POP predictive model based on readily available preoperative variables achieved good accuracy and was corroborated by external data.The established web risk calculator would facilitate clinical application to identify high-risk patients for early intervention or specific care.

Fig. 1
Fig. 1 Flow chart of patient enrollment in this study

Fig. 2
Fig. 2 Comparison of AUROC and AUPRC curves among LR, RFC, Catboost, XGB, and LGBM in the training and testing sets.(A) AUROC curves of the training set (B) AUROC curves of the testing set (C) AUPRC curves of the training set (D) AUPRC curves of the testing set.AUROC, the area under the receiver operating characteristic; AUPRC, the area under the precision-recall curve; LR, logistic regression; RFC, random forest classifier; Catboost, categorical boosting; XGB, extreme gradient boosting; LGBM, light gradient boosting machine

Fig. 5
Fig. 5 SHAP summary plot for the seven influential variables in the Catboost model.(A) The average absolute influence of each factor on the model output magnitude was presented in descending order of feature significance; (B) The graph depicted the dot estimate of the Catboost model output, with each dot corresponding to a patient in the dataset.Catboost, categorical boosting; mFI-5, modified five-item frailty index; SpO 2 , Peripheral capillary oxygen saturation; ASA, American Society of Anesthesiologists

Fig. 6
Fig. 6 The risk web calculator was designed based on the Catboost model.Catboost, categorical boosting

Table 1
Demographics and Potential Risk Factors of patients in the dataset BMI, body mass index (calculated as weight in kilograms divided by height in meters squared); CRP, C-reactive Protein; Cr, Creatinine; mFI-5, modified frailty index; SpO 2 , Peripheral capillary oxygen saturation; ASA, American Society of Anesthesiologists

Table 2
The association of selected variables with pneumonia using multivariate logistic regression in the training set